MEM T380

Case Studies Group 10

Ante Sokosa
Ziad Hatab

HW2A+B


PART B, Decision Tree, STARTS AT CELL [63]¶

If viewing in html and cannot see cell numbers, CTRL+F and search for Decision Tree¶


In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn import preprocessing
# set seaborn's default settings
sns.set()

Import Data¶

We import the data into multiple dataframes by their subset for later use, but also can concatenate them for current use.

In [ ]:
excel_file = 'weld_defect_dataset.xlsx'
subsets = []

for i in range(1, 6):
    subset = pd.read_excel(excel_file, sheet_name = 'subset' + str(i))
    subset = subset.rename(columns={'Type':'type',
                                    'W':'w',
                                    'Ar':'ar',
                                    'Sp':'sp',
                                    'Re':'re',
                                    'Rr':'rr',
                                    'Sk':'sk',
                                    'Ku':'ku',
                                    'Hc':'hc',
                                    'Rc':'rc',
                                    'Sc ':'sc',
                                    'Kc ':'kc'}) #note the space after Sc and Kc are errors in naming in the excel file and are corrected here for ease of use later
    subsets.append(subset)

subsetsall = pd.concat(subsets, ignore_index=True)
subsetsall
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc
0 PO 0.008596 0.006897 0.5748 0.838397 0.998562 0.091802 0.908459 0.003151 0.111302 0.256742 0.389952
1 PO 0.010029 0.003448 0.4112 0.838397 0.649317 0.039172 0.476520 0.002817 0.121299 0.332611 0.443785
2 PO 0.007163 0.003448 0.4400 1.007173 0.754309 0.048079 0.766430 0.002621 0.127759 0.323068 0.444515
3 PO 0.028653 0.003448 0.3124 0.534599 0.061617 0.244800 0.789110 0.010007 0.092632 0.220312 0.339685
4 PO 0.018625 0.003448 0.4024 0.557089 0.037346 0.578774 0.630554 0.006757 0.073914 0.270908 0.273045
... ... ... ... ... ... ... ... ... ... ... ... ...
215 CR 0.277937 0.949262 1.0268 0.102869 0.723013 0.025025 0.468658 0.101296 0.757683 0.231426 0.516244
216 CR 0.148997 0.720690 0.8172 0.055527 0.509504 0.135456 0.551284 0.010890 0.262126 0.410800 0.530843
217 CR 0.320917 0.846359 0.7100 0.106793 0.407912 0.027538 0.488077 0.191586 0.757547 0.158517 0.559012
218 CR 0.322350 0.578386 0.6420 0.143629 0.384393 0.039732 0.492730 0.154902 0.640716 0.218541 0.567931
219 CR 0.372493 0.799686 0.8580 0.167046 0.235256 0.075930 0.558360 0.268964 0.637409 0.164191 0.586349

220 rows × 12 columns

For Reference:

image.png

We now have indivudal dataframes for each subset of data, but we also have one large dataframe for current use / data exploration where applicable:

In [ ]:
print(subsets[0].shape)
print(subsets[4].shape)
print(subsetsall.shape)
(44, 12)
(44, 12)
(220, 12)

With .info() we see that the data is in a very clean format, with no missing values (220 non-null), and all data types are correct.

In [ ]:
subsetsall.info() 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   type    220 non-null    object 
 1   w       220 non-null    float64
 2   ar      220 non-null    float64
 3   sp      220 non-null    float64
 4   re      220 non-null    float64
 5   rr      220 non-null    float64
 6   sk      220 non-null    float64
 7   ku      220 non-null    float64
 8   hc      220 non-null    float64
 9   rc      220 non-null    float64
 10  sc      220 non-null    float64
 11  kc      220 non-null    float64
dtypes: float64(11), object(1)
memory usage: 20.8+ KB

or we can do so with .dtypes and .isnull().sum().sum() :

In [ ]:
subsetsall.dtypes
Out[ ]:
type     object
w       float64
ar      float64
sp      float64
re      float64
rr      float64
sk      float64
ku      float64
hc      float64
rc      float64
sc      float64
kc      float64
dtype: object
In [ ]:
subsetsall.isna().sum().sum()
Out[ ]:
0

We can also use .describe() to get a quick overview of the data:

In [ ]:
subsets[0].describe()
Out[ ]:
w ar sp re rr sk ku hc rc sc kc
count 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000
mean 0.193312 0.219518 0.596164 0.408728 0.244100 0.190435 0.618382 0.114831 0.227141 0.222136 0.468713
std 0.220107 0.260694 0.217955 0.274365 0.223650 0.171583 0.145415 0.157291 0.209002 0.113051 0.138047
min 0.007163 0.003448 0.192400 0.004051 0.001971 0.011205 0.269329 0.001358 0.032748 0.002616 0.100178
25% 0.041189 0.014874 0.415600 0.145802 0.087699 0.080167 0.536612 0.017436 0.110421 0.142772 0.412662
50% 0.088109 0.062179 0.551800 0.492258 0.169426 0.115660 0.585224 0.038245 0.155732 0.215720 0.468566
75% 0.280444 0.393534 0.830500 0.599842 0.330571 0.243957 0.672636 0.131052 0.218242 0.292278 0.545424
max 1.000000 0.826724 0.928000 1.007173 0.998562 0.681613 1.113649 0.617477 1.001281 0.571364 0.911416
In [ ]:
subsets[4].describe()
Out[ ]:
w ar sp re rr sk ku hc rc sc kc
count 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000
mean 0.175860 0.220914 0.623318 0.391490 0.243808 0.190940 0.582134 0.080576 0.219547 0.225331 0.502365
std 0.188479 0.276367 0.209529 0.277337 0.203288 0.221725 0.163183 0.095121 0.180705 0.152141 0.164504
min 0.007163 0.003448 0.157200 0.002152 0.000200 0.001608 0.168895 0.002135 0.013219 0.001474 0.155346
25% 0.038682 0.009670 0.438300 0.135696 0.093246 0.074033 0.495114 0.015487 0.099992 0.131640 0.416361
50% 0.090974 0.059112 0.638000 0.437933 0.199206 0.132192 0.563511 0.035016 0.178824 0.213700 0.498491
75% 0.263968 0.362931 0.783000 0.606814 0.334925 0.158128 0.643986 0.125883 0.265546 0.309116 0.582118
max 0.816619 0.949262 1.026800 1.007173 0.817756 1.002376 1.128828 0.378949 0.757683 0.729507 0.990413
In [ ]:
subsetsall.describe()
Out[ ]:
w ar sp re rr sk ku hc rc sc kc
count 220.000000 220.000000 220.000000 220.000000 220.000000 220.000000 220.000000 220.000000 220.000000 220.000000 220.000000
mean 0.175905 0.207577 0.599259 0.392519 0.249057 0.167519 0.604765 0.092316 0.216053 0.240740 0.478316
std 0.192450 0.256669 0.216377 0.265337 0.208797 0.164088 0.150983 0.138605 0.173100 0.143031 0.150625
min 0.001433 0.003448 0.025200 0.000591 0.000118 0.001608 0.168895 0.000013 0.004129 0.001474 0.028573
25% 0.035817 0.011860 0.415600 0.131772 0.086298 0.065242 0.519888 0.012539 0.107777 0.145720 0.372002
50% 0.078080 0.062179 0.586600 0.412764 0.213024 0.113033 0.571244 0.033709 0.158049 0.216357 0.481886
75% 0.277937 0.362931 0.826200 0.604219 0.339437 0.198047 0.670390 0.111988 0.254074 0.317875 0.572017
max 1.000000 1.037931 1.026800 1.007173 1.003975 1.002376 1.202949 1.049198 1.001281 1.000876 1.025173

From the 3 .describe() commands, it can be seen that all subsets seem to have similar data, so for further data exploration we will use the full concatenated dataset.

In [ ]:
# original subset is 44 rows
# original sebsetsall is 220 rows (44*5)

for i in range(5):
    subsets[i].drop_duplicates(inplace=True)
    print(subsets[i].shape)

print(subsetsall.shape)
subsetsall_temp = subsetsall.copy()
subsetsall_temp.drop_duplicates(inplace=True)
print(subsetsall_temp.shape)
(44, 12)
(44, 12)
(44, 12)
(44, 12)
(44, 12)
(220, 12)
(219, 12)

We can see that no individual subset has any duplicated entires, but the concatenated dataframe does. With the precision of 4-6 decimal numbers over 11 columns in the dataframe, it is unlikely to be a true measured duplicate and more likely to be an duplicate entry. We will leave the dublicate removed from the concatenated dataframe and remove it from one of the subsets.

In [ ]:
duplicated_row = subsetsall[subsetsall.duplicated()]
duplicated_row
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc
50 PO 0.015759 0.003448 0.4552 0.627426 0.056636 0.116363 0.678178 0.004218 0.107777 0.266969 0.444385
In [ ]:
# Assuming you have already read the DataFrames into 'subsets' list
duplicated_row = subsetsall[subsetsall.duplicated()].iloc[0]

def is_same_row(row, target_row):
    return row.equals(target_row)

for i in range(5):
    # Check if each row in the DataFrame is the same as the example_row
    same_rows = subsets[i].apply(is_same_row, axis=1, args=(duplicated_row,))
    if same_rows.any():
        print("Subset " + str(i+1) + ":")
        print("Rows that are the same as the example row:")
        print(subsets[i][same_rows])
Subset 1:
Rows that are the same as the example row:
  type         w        ar      sp        re        rr        sk        ku  \
6   PO  0.015759  0.003448  0.4552  0.627426  0.056636  0.116363  0.678178   

         hc        rc        sc        kc  
6  0.004218  0.107777  0.266969  0.444385  
Subset 2:
Rows that are the same as the example row:
  type         w        ar      sp        re        rr        sk        ku  \
6   PO  0.015759  0.003448  0.4552  0.627426  0.056636  0.116363  0.678178   

         hc        rc        sc        kc  
6  0.004218  0.107777  0.266969  0.444385  

It can be seen from the above that subset1 and subset2 have their index 6 row as the same. For reasons mentioned earlier we will remove this row from one of the subsets, in this case subset2 as it came later in the data than subset1.

In [ ]:
subsets[1].head(10)
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc
0 PO 0.008596 0.003448 0.6420 0.416456 1.003975 0.116834 0.961553 0.003549 0.125551 0.345271 0.407876
1 PO 0.012894 0.003448 0.3784 0.235612 0.599335 0.100720 0.661515 0.003367 0.100663 0.161510 0.362336
2 PO 0.012894 0.003448 0.3784 0.235612 0.472149 0.015586 0.373776 0.003766 0.070719 0.240516 0.371707
3 PO 0.010029 0.003448 0.2152 0.979030 0.421204 0.043400 0.783220 0.003151 0.004129 0.017908 0.028573
4 PO 0.020057 0.003448 0.5600 0.527511 0.050374 0.211741 0.725096 0.008584 0.084605 0.281740 0.329137
5 PO 0.011461 0.003448 0.2996 0.773502 0.310067 0.058798 0.536005 0.001847 0.116556 0.199172 0.441565
6 PO 0.015759 0.003448 0.4552 0.627426 0.056636 0.116363 0.678178 0.004218 0.107777 0.266969 0.444385
7 PO 0.035817 0.003448 0.4156 0.669620 0.007348 0.266460 0.738220 0.016878 0.039208 0.125001 0.242095
8 PO 0.011461 0.003448 0.3000 0.773502 0.121862 0.145167 0.534106 0.001259 0.123437 0.205165 0.502887
9 PO 0.053009 0.003448 0.4104 0.476751 0.081149 0.345283 1.202949 0.034153 0.087581 0.567489 0.475278
In [ ]:
subsets[1].drop(6, inplace=True)
print(subsets[1].shape)
(43, 12)
In [ ]:
subsets[1].head(10)
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc
0 PO 0.008596 0.003448 0.6420 0.416456 1.003975 0.116834 0.961553 0.003549 0.125551 0.345271 0.407876
1 PO 0.012894 0.003448 0.3784 0.235612 0.599335 0.100720 0.661515 0.003367 0.100663 0.161510 0.362336
2 PO 0.012894 0.003448 0.3784 0.235612 0.472149 0.015586 0.373776 0.003766 0.070719 0.240516 0.371707
3 PO 0.010029 0.003448 0.2152 0.979030 0.421204 0.043400 0.783220 0.003151 0.004129 0.017908 0.028573
4 PO 0.020057 0.003448 0.5600 0.527511 0.050374 0.211741 0.725096 0.008584 0.084605 0.281740 0.329137
5 PO 0.011461 0.003448 0.2996 0.773502 0.310067 0.058798 0.536005 0.001847 0.116556 0.199172 0.441565
7 PO 0.035817 0.003448 0.4156 0.669620 0.007348 0.266460 0.738220 0.016878 0.039208 0.125001 0.242095
8 PO 0.011461 0.003448 0.3000 0.773502 0.121862 0.145167 0.534106 0.001259 0.123437 0.205165 0.502887
9 PO 0.053009 0.003448 0.4104 0.476751 0.081149 0.345283 1.202949 0.034153 0.087581 0.567489 0.475278
10 SL 0.018625 0.020690 0.6496 0.744641 0.304496 0.052827 0.370475 0.007743 0.187048 0.500302 0.330965

Reset index in the two dataframes with removed rows:

In [ ]:
subsetsall = subsetsall_temp.copy()

subsetsall.reset_index(drop=True, inplace=True) # use the drop=True to avoid the old index being added as a column, and having to drop it later
subsets[1].reset_index(drop=True, inplace=True)
subsets[1].head(10)
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc
0 PO 0.008596 0.003448 0.6420 0.416456 1.003975 0.116834 0.961553 0.003549 0.125551 0.345271 0.407876
1 PO 0.012894 0.003448 0.3784 0.235612 0.599335 0.100720 0.661515 0.003367 0.100663 0.161510 0.362336
2 PO 0.012894 0.003448 0.3784 0.235612 0.472149 0.015586 0.373776 0.003766 0.070719 0.240516 0.371707
3 PO 0.010029 0.003448 0.2152 0.979030 0.421204 0.043400 0.783220 0.003151 0.004129 0.017908 0.028573
4 PO 0.020057 0.003448 0.5600 0.527511 0.050374 0.211741 0.725096 0.008584 0.084605 0.281740 0.329137
5 PO 0.011461 0.003448 0.2996 0.773502 0.310067 0.058798 0.536005 0.001847 0.116556 0.199172 0.441565
6 PO 0.035817 0.003448 0.4156 0.669620 0.007348 0.266460 0.738220 0.016878 0.039208 0.125001 0.242095
7 PO 0.011461 0.003448 0.3000 0.773502 0.121862 0.145167 0.534106 0.001259 0.123437 0.205165 0.502887
8 PO 0.053009 0.003448 0.4104 0.476751 0.081149 0.345283 1.202949 0.034153 0.087581 0.567489 0.475278
9 SL 0.018625 0.020690 0.6496 0.744641 0.304496 0.052827 0.370475 0.007743 0.187048 0.500302 0.330965

Use .describe() and .info() on our cleaned data (concatenated only is okay here):

In [ ]:
subsetsall.describe()
Out[ ]:
w ar sp re rr sk ku hc rc sc kc
count 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000
mean 0.176636 0.208510 0.599917 0.391446 0.249936 0.167753 0.604430 0.092718 0.216547 0.240620 0.478471
std 0.192585 0.256884 0.216652 0.265466 0.208867 0.164427 0.151247 0.138793 0.173341 0.143348 0.150952
min 0.001433 0.003448 0.025200 0.000591 0.000118 0.001608 0.168895 0.000013 0.004129 0.001474 0.028573
25% 0.037250 0.012357 0.415600 0.131266 0.087341 0.064804 0.519690 0.012782 0.108342 0.145656 0.371903
50% 0.078797 0.062834 0.587200 0.409072 0.214041 0.112573 0.570744 0.034153 0.158479 0.215147 0.482290
75% 0.277937 0.363793 0.826800 0.602742 0.341661 0.199904 0.669316 0.113712 0.254108 0.318871 0.572080
max 1.000000 1.037931 1.026800 1.007173 1.003975 1.002376 1.202949 1.049198 1.001281 1.000876 1.025173
In [ ]:
subsetsall.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219 entries, 0 to 218
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   type    219 non-null    object 
 1   w       219 non-null    float64
 2   ar      219 non-null    float64
 3   sp      219 non-null    float64
 4   re      219 non-null    float64
 5   rr      219 non-null    float64
 6   sk      219 non-null    float64
 7   ku      219 non-null    float64
 8   hc      219 non-null    float64
 9   rc      219 non-null    float64
 10  sc      219 non-null    float64
 11  kc      219 non-null    float64
dtypes: float64(11), object(1)
memory usage: 20.7+ KB

We can also check for any previously labeled numerical data to turn into categorical data:

In [ ]:
subsetsall.nunique(axis=0) 
Out[ ]:
type      5
w       138
ar      164
sp      181
re      175
rr      219
sk      219
ku      219
hc      218
rc      219
sc      219
kc      219
dtype: int64

The only column with a low enough number on unique values to be considered categorical data is the type column, which it already is and we will leave it as so.

We can create a num list but a cat list is not necessary for this dataset as there is only one categorical column.

In [ ]:
nums = list(subsetsall.select_dtypes(exclude=['object']).columns)
nums
Out[ ]:
['w', 'ar', 'sp', 're', 'rr', 'sk', 'ku', 'hc', 'rc', 'sc', 'kc']

Visualize Data¶

PairPlot:

In [ ]:
sns.pairplot(subsetsall, vars=nums, hue='type')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x12f5fdaa6b0>

It can be seen that type PO spikes heavily in some cases such as hc vs hc and ar vs ac, but also is ponounced in other cases. We will talk about this later.

Currently, we can identify some features that are useful for finding trends. From initial visual inspeciton, re and rc plotted against any other feature seem to decently split the data into their categorical types. Specifically, re vs rc seems the individual best.

However, others are better at splitting only certain types.
For example, kc vs w, ar, rr, and hc split PO and SL quite well, while ar vs rc splits LP and CR well.

A heatmap of the correlation between the numerical data can be made:

In [ ]:
sns.heatmap(subsetsall[nums].corr(), annot=True)
Out[ ]:
<Axes: >

re by far has some very good correlations with other features, nameley w, ar, and sp. rc also has some good ones.
Our visual inference was accurate.

Noimalizaiton will likley not help here as it can already be seen that the data is near normal (max near 1 and min near 0):

In [ ]:
subsetsall.describe()
Out[ ]:
w ar sp re rr sk ku hc rc sc kc
count 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000
mean 0.176636 0.208510 0.599917 0.391446 0.249936 0.167753 0.604430 0.092718 0.216547 0.240620 0.478471
std 0.192585 0.256884 0.216652 0.265466 0.208867 0.164427 0.151247 0.138793 0.173341 0.143348 0.150952
min 0.001433 0.003448 0.025200 0.000591 0.000118 0.001608 0.168895 0.000013 0.004129 0.001474 0.028573
25% 0.037250 0.012357 0.415600 0.131266 0.087341 0.064804 0.519690 0.012782 0.108342 0.145656 0.371903
50% 0.078797 0.062834 0.587200 0.409072 0.214041 0.112573 0.570744 0.034153 0.158479 0.215147 0.482290
75% 0.277937 0.363793 0.826800 0.602742 0.341661 0.199904 0.669316 0.113712 0.254108 0.318871 0.572080
max 1.000000 1.037931 1.026800 1.007173 1.003975 1.002376 1.202949 1.049198 1.001281 1.000876 1.025173

We can try standardization to see if it helps:

In [ ]:
subsetsall_std = subsetsall.copy()
subsetsall_std.head()
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc
0 PO 0.008596 0.006897 0.5748 0.838397 0.998562 0.091802 0.908459 0.003151 0.111302 0.256742 0.389952
1 PO 0.010029 0.003448 0.4112 0.838397 0.649317 0.039172 0.476520 0.002817 0.121299 0.332611 0.443785
2 PO 0.007163 0.003448 0.4400 1.007173 0.754309 0.048079 0.766430 0.002621 0.127759 0.323068 0.444515
3 PO 0.028653 0.003448 0.3124 0.534599 0.061617 0.244800 0.789110 0.010007 0.092632 0.220312 0.339685
4 PO 0.018625 0.003448 0.4024 0.557089 0.037346 0.578774 0.630554 0.006757 0.073914 0.270908 0.273045
In [ ]:
sc = '_std'
nums_std = []
for s in nums:
    s = s + sc
    nums_std.append(s)
print(nums_std)
['w_std', 'ar_std', 'sp_std', 're_std', 'rr_std', 'sk_std', 'ku_std', 'hc_std', 'rc_std', 'sc_std', 'kc_std']
In [ ]:
std_Scaler = preprocessing.StandardScaler()

std_Scaler.fit(subsetsall_std[nums])

subsetsall_std[nums_std] = std_Scaler.transform(subsetsall_std[nums])

subsetsall_std.drop(nums, axis=1, inplace=True)
subsetsall_std.head()
Out[ ]:
type w_std ar_std sp_std re_std rr_std sk_std ku_std hc_std rc_std sc_std kc_std
0 PO -0.874553 -0.786637 -0.116198 1.687500 3.592437 -0.462970 2.014749 -0.646807 -0.608549 0.112723 -0.587744
1 PO -0.867095 -0.800094 -0.873054 1.687500 1.916513 -0.783785 -0.847640 -0.649219 -0.550744 0.643202 -0.230305
2 PO -0.882011 -0.800094 -0.739818 2.324728 2.420339 -0.729491 1.073546 -0.650634 -0.513391 0.576477 -0.225458
3 PO -0.770168 -0.800094 -1.330129 0.540485 -0.903687 0.469655 1.223842 -0.597296 -0.716502 -0.141997 -0.921506
4 PO -0.822358 -0.800094 -0.913765 0.625398 -1.020156 2.505449 0.173118 -0.620766 -0.824734 0.211772 -1.363982
In [ ]:
subsetsall_std.describe()
Out[ ]:
w_std ar_std sp_std re_std rr_std sk_std ku_std hc_std rc_std sc_std kc_std
count 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02 2.190000e+02
mean -8.111218e-18 -3.244487e-17 1.662800e-16 -9.733462e-17 -6.083414e-18 -7.502877e-17 1.022014e-15 3.244487e-17 -4.461170e-17 9.125121e-17 -4.866731e-17
std 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00 1.002291e+00
min -9.118320e-01 -8.000944e-01 -2.658791e+00 -1.475707e+00 -1.198802e+00 -1.012763e+00 -2.886221e+00 -6.694647e-01 -1.228245e+00 -1.672119e+00 -2.987224e+00
25% -7.254280e-01 -7.653339e-01 -8.526988e-01 -9.823347e-01 -7.802472e-01 -6.275409e-01 -5.615597e-01 -5.772531e-01 -6.256668e-01 -6.639940e-01 -7.075824e-01
50% -5.091975e-01 -5.683864e-01 -5.883228e-02 6.654728e-02 -1.722487e-01 -3.363570e-01 -2.232333e-01 -4.229268e-01 -3.357616e-01 -1.781106e-01 2.536041e-02
75% 5.272102e-01 6.058740e-01 1.049620e+00 7.977667e-01 4.401597e-01 1.959837e-01 4.299905e-01 1.516094e-01 2.171876e-01 5.471275e-01 6.215435e-01
max 4.285127e+00 3.236178e+00 1.974872e+00 2.324728e+00 3.618412e+00 5.087585e+00 3.966286e+00 6.907185e+00 4.537491e+00 5.315733e+00 3.629988e+00
In [ ]:
sns.pairplot(subsetsall_std, vars=nums_std, hue='type')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x12f5fd8f220>

The data still looks very similar. We will not use the stnadardized data and instead stick with the original near normal data.

Lets come back to the PO type spike from our pairplots. Lets check if it is because of many PO types in the data:

In [ ]:
sns.countplot(x='type', data=subsetsall)
Out[ ]:
<Axes: xlabel='type', ylabel='count'>

There is not an overwhelming amount of PO types in the data, so it must be truly strongly correlated as represnted by the pairplot Kernel Density Equation diagonal.

We can investigate it further with a boxplot.

In [ ]:
sns.boxplot(x='type', y='hc', data=subsetsall)
Out[ ]:
<Axes: xlabel='type', ylabel='hc'>

Low hc very likely means the weld defect is PO. This is a strong relation we can use later but must ensure we do not overlook other features, as other types also have some low hc values.

More plots of the data can be made to try to spot anything not yet clear if questions arise in later analysis.

Additionals:¶

One thing we might want to do for later is binarize the type column so we can use it in our models:

In [ ]:
types = subsetsall['type'].unique()
types = list(types)
types
Out[ ]:
['PO', 'SL', 'LP', 'LF', 'CR']
In [ ]:
from sklearn.preprocessing import label_binarize

type_num = label_binarize(subsetsall.type, classes=types)
print(type_num)
[[1 0 0 0 0]
 [1 0 0 0 0]
 [1 0 0 0 0]
 ...
 [0 0 0 0 1]
 [0 0 0 0 1]
 [0 0 0 0 1]]
In [ ]:
for i in range(len(types)):
    subsetsall[types[i]] = type_num[:,i]

subsetsall
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc PO SL LP LF CR
0 PO 0.008596 0.006897 0.5748 0.838397 0.998562 0.091802 0.908459 0.003151 0.111302 0.256742 0.389952 1 0 0 0 0
1 PO 0.010029 0.003448 0.4112 0.838397 0.649317 0.039172 0.476520 0.002817 0.121299 0.332611 0.443785 1 0 0 0 0
2 PO 0.007163 0.003448 0.4400 1.007173 0.754309 0.048079 0.766430 0.002621 0.127759 0.323068 0.444515 1 0 0 0 0
3 PO 0.028653 0.003448 0.3124 0.534599 0.061617 0.244800 0.789110 0.010007 0.092632 0.220312 0.339685 1 0 0 0 0
4 PO 0.018625 0.003448 0.4024 0.557089 0.037346 0.578774 0.630554 0.006757 0.073914 0.270908 0.273045 1 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 CR 0.277937 0.949262 1.0268 0.102869 0.723013 0.025025 0.468658 0.101296 0.757683 0.231426 0.516244 0 0 0 0 1
215 CR 0.148997 0.720690 0.8172 0.055527 0.509504 0.135456 0.551284 0.010890 0.262126 0.410800 0.530843 0 0 0 0 1
216 CR 0.320917 0.846359 0.7100 0.106793 0.407912 0.027538 0.488077 0.191586 0.757547 0.158517 0.559012 0 0 0 0 1
217 CR 0.322350 0.578386 0.6420 0.143629 0.384393 0.039732 0.492730 0.154902 0.640716 0.218541 0.567931 0 0 0 0 1
218 CR 0.372493 0.799686 0.8580 0.167046 0.235256 0.075930 0.558360 0.268964 0.637409 0.164191 0.586349 0 0 0 0 1

219 rows × 17 columns

In [ ]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() # Create label encoder

encoded_subsetsall = []

for i in range(5):
    subsetsall['type_num'] = le.fit_transform(subsetsall['type']) # encode type column

subsetsall.head(40)
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc PO SL LP LF CR type_num
0 PO 0.008596 0.006897 0.5748 0.838397 0.998562 0.091802 0.908459 0.003151 0.111302 0.256742 0.389952 1 0 0 0 0 3
1 PO 0.010029 0.003448 0.4112 0.838397 0.649317 0.039172 0.476520 0.002817 0.121299 0.332611 0.443785 1 0 0 0 0 3
2 PO 0.007163 0.003448 0.4400 1.007173 0.754309 0.048079 0.766430 0.002621 0.127759 0.323068 0.444515 1 0 0 0 0 3
3 PO 0.028653 0.003448 0.3124 0.534599 0.061617 0.244800 0.789110 0.010007 0.092632 0.220312 0.339685 1 0 0 0 0 3
4 PO 0.018625 0.003448 0.4024 0.557089 0.037346 0.578774 0.630554 0.006757 0.073914 0.270908 0.273045 1 0 0 0 0 3
5 PO 0.011461 0.003448 0.2996 0.773502 0.133474 0.243676 0.452340 0.001358 0.090320 0.259598 0.482290 1 0 0 0 0 3
6 PO 0.015759 0.003448 0.4552 0.627426 0.056636 0.116363 0.678178 0.004218 0.107777 0.266969 0.444385 1 0 0 0 0 3
7 PO 0.027221 0.003448 0.4156 0.557089 0.101013 0.093192 0.939251 0.008386 0.070805 0.002616 0.368677 1 0 0 0 0 3
8 PO 0.030086 0.003448 0.4248 0.513840 0.001971 0.358502 0.653443 0.014692 0.032748 0.172884 0.287875 1 0 0 0 0 3
9 PO 0.035817 0.003448 0.4156 0.669620 0.004285 0.681613 0.451320 0.017031 0.038732 0.211128 0.444937 1 0 0 0 0 3
10 SL 0.050143 0.034648 0.5228 0.516920 0.177013 0.063348 0.494043 0.019171 0.197270 0.301136 0.565477 0 1 0 0 0 4
11 SL 0.025788 0.015617 0.3340 0.471477 0.276169 0.063095 0.507786 0.005105 0.141924 0.245195 0.542960 0 1 0 0 0 4
12 SL 0.063037 0.081610 0.4756 0.551477 0.400952 0.085216 0.625463 0.026221 0.226071 0.105925 0.592413 0 1 0 0 0 4
13 SL 0.067335 0.058621 0.5132 0.611181 0.390934 0.082072 0.670788 0.028746 0.174749 0.118310 0.519074 0 1 0 0 0 4
14 SL 0.042980 0.021438 0.5288 0.766076 0.346107 0.140695 0.653494 0.017571 0.150256 0.382587 0.420232 0 1 0 0 0 4
15 SL 0.138968 0.115517 0.5192 0.513840 0.053111 0.232464 0.481685 0.214936 0.302905 0.081632 0.646642 0 1 0 0 0 4
16 SL 0.147564 0.023731 0.2656 0.884430 0.039914 0.596128 0.654200 0.109447 0.097271 0.268753 0.100178 0 1 0 0 0 4
17 SL 0.074499 0.012645 0.1924 0.631139 0.099007 0.292338 0.753337 0.041720 0.102973 0.206860 0.436098 0 1 0 0 0 4
18 SL 0.074499 0.044562 0.4872 0.599283 0.089800 0.320126 0.538374 0.027256 0.125914 0.240262 0.627607 0 1 0 0 0 4
19 SL 0.065903 0.061524 0.6596 0.513840 0.161838 0.643830 0.269329 0.034770 0.158479 0.210568 0.911416 0 1 0 0 0 4
20 LP 0.415473 0.431379 0.8668 0.068861 0.125231 0.168406 0.767691 0.088235 0.193620 0.131252 0.242035 0 0 1 0 0 2
21 LP 0.613181 0.324466 0.7932 0.253713 0.030618 0.114958 0.715633 0.483561 0.106063 0.041874 0.302604 0 0 1 0 0 2
22 LP 0.253582 0.417241 0.8664 0.110422 0.329608 0.069216 0.537527 0.060204 0.208140 0.010399 0.480510 0 0 1 0 0 2
23 LP 0.187679 0.359769 0.8320 0.037932 0.318771 0.116448 1.113649 0.017783 0.156013 0.068120 0.511571 0 0 1 0 0 2
24 LP 0.550143 0.242717 0.8284 0.087806 0.068943 0.479354 0.693576 0.411346 0.105377 0.289325 0.250541 0 0 1 0 0 2
25 LP 1.000000 0.642338 0.8376 0.112152 0.134046 0.106603 0.670257 0.617477 0.132390 0.144749 0.343697 0 0 1 0 0 2
26 LP 0.777937 0.493869 0.8384 0.046878 0.144770 0.091868 0.657572 0.266960 0.134514 0.331065 0.354818 0 0 1 0 0 2
27 LP 0.246418 0.308045 0.8300 0.004051 0.088274 0.344919 0.551920 0.050698 0.145350 0.306394 0.423902 0 0 1 0 0 2
28 LP 0.339542 0.289921 0.8372 0.102152 0.053839 0.221241 0.570290 0.172724 0.166593 0.162887 0.479350 0 0 1 0 0 2
29 LP 0.343840 0.391379 0.8336 0.015190 0.085972 0.160149 0.595178 0.117161 0.181476 0.339861 0.552816 0 0 1 0 0 2
30 LF 0.402579 0.221838 0.6692 0.209451 0.103607 0.246070 0.577897 0.417545 0.247450 0.204585 0.438135 0 0 0 1 0 1
31 LF 0.063037 0.062834 0.2300 0.304515 0.299782 0.203618 0.640899 0.019800 0.181293 0.189060 0.498422 0 0 0 1 0 1
32 LF 0.206304 0.111686 0.7932 0.353038 0.130069 0.098141 0.547505 0.227999 0.333604 0.163009 0.631823 0 0 0 1 0 1
33 LF 0.071633 0.055172 0.5780 0.409072 0.196221 0.068892 0.592551 0.029966 0.215632 0.344792 0.457783 0 0 0 1 0 1
34 LF 0.075931 0.027790 0.4628 0.571181 0.333461 0.044079 0.570744 0.052599 0.161889 0.131012 0.480531 0 0 0 1 0 1
35 LF 0.073066 0.022607 0.3972 0.601519 0.224584 0.211671 0.751020 0.061937 0.114510 0.136843 0.454621 0 0 0 1 0 1
36 LF 0.110315 0.054652 0.3364 0.513038 0.258901 0.038028 0.499730 0.071062 0.155450 0.101629 0.489576 0 0 0 1 0 1
37 CR 0.100287 0.400000 0.7424 0.152025 0.516416 0.137537 0.555328 0.022585 0.432063 0.414954 0.528984 0 0 0 0 1 0
38 CR 0.209169 0.547510 0.8400 0.127131 0.405834 0.074451 0.543629 0.034410 0.347480 0.284818 0.439879 0 0 0 0 1 0
39 CR 0.277937 0.826724 0.8816 0.085738 0.704132 0.011205 0.486398 0.104069 0.731836 0.149571 0.529300 0 0 0 0 1 0

Classification¶

As explored in data preprocessing and visualziation we will use the re and rc features to classify the data.

Lets review this indivudal scatterplot:

In [ ]:
sns.scatterplot(data=subsetsall, x='re', y='rc', hue='type')

plt.show()

Upon inspection of this plot, it is best to use the following 3 types as our targets: CR, LP, SL. (CR and LP were noted earlier in data preprocessing and visualization, but we will add SL as well due to what we can see more closely in this blowup of the plot.)

More inferences can also be made:

  • It will be easy to make a decision boundary between LP and SL but harder to make a decision with a KNN approach here as the spread over the re axis is larger for SL than LP.

  • For the split between CR and LP, it seems to be the opposite case. A KNN approach might produce better results due to the closely clumped data points of CR near the fuzzy border and decently clumped LP data points slightly fartehr from the border.

Also, before we move on to classification, we will do a quick cleanup / restructure of the data to only contain the features and targets we want as stated above. (We will also leave the data in its subsets as to use these as splits for training and testing.) (Also, we encode the type column for later use.)

In [ ]:
encoded_subsets = []

for i in range(5):
    subsets[i] = subsets[i][['type','re', 'rc']] # only keep the columns we want (by features + target column)
    subsets[i] = subsets[i][subsets[i]['type'].isin(['CR','LP','SL'])] # only keep the rows we want (by type)
    subsets[i].reset_index(drop=True, inplace=True) # reset index
    subsets[i]['type_num'] = le.fit_transform(subsets[i]['type']) # encode type column
    print(subsets[i].shape)
(27, 4)
(27, 4)
(27, 4)
(27, 4)
(27, 4)

The total of these types in each subset was equivalet, that's convenient for cross validation later.

One of the new subsets:

In [ ]:
subsets[4]
Out[ ]:
type re rc type_num
0 SL 0.725865 0.218229 2
1 SL 0.638143 0.212740 2
2 SL 0.504768 0.146195 2
3 SL 0.316709 0.188856 2
4 SL 0.870253 0.217319 2
5 SL 0.730675 0.095278 2
6 SL 0.606329 0.179129 2
7 SL 0.622489 0.101563 2
8 SL 0.771772 0.092916 2
9 SL 0.585232 0.160975 2
10 LP 0.091646 0.182427 1
11 LP 0.162194 0.074074 1
12 LP 0.115063 0.233013 1
13 LP 0.153840 0.207985 1
14 LP 0.185274 0.132899 1
15 LP 0.003038 0.092287 1
16 LP 0.100253 0.130390 1
17 LP 0.002152 0.204965 1
18 LP 0.042278 0.178519 1
19 LP 0.006034 0.292241 1
20 CR 0.142574 0.308130 0
21 CR 0.071941 0.398472 0
22 CR 0.102869 0.757683 0
23 CR 0.055527 0.262126 0
24 CR 0.106793 0.757547 0
25 CR 0.143629 0.640716 0
26 CR 0.167046 0.637409 0

KNN¶

Now we can make our KNN model.

Create training and testing data from the subsets:

In [ ]:
X = []
y = []
for i in range(5):
    X.append(subsets[i][['re', 'rc']].values)
    y.append(subsets[i]['type_num'].values)
print(X[4].shape)
print(y[4].shape)

# Training Data - subsets 1-4 (80% of data)
X_train = np.concatenate(X[:4], axis=0)
y_train = np.concatenate(y[:4], axis=0)
print(X_train.shape)
print(y_train.shape)

# Testing Data - subset 5 (20% of data)
X_test = X[4]
y_test = y[4]
print(X_test.shape)
print(y_test.shape)
(27, 2)
(27,)
(108, 2)
(108,)
(27, 2)
(27,)

Jointplot for more visualizaiton before we begin:

In [ ]:
sns.jointplot(x='re', y='rc', data=subsets[4], hue='type_num', kind='scatter')
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x12f78675390>

Fit KNN model:

In [ ]:
from sklearn.neighbors import KNeighborsClassifier

k = 5
knn_model = KNeighborsClassifier(n_neighbors=k)

knn_model.fit(X_train, y_train)
Out[ ]:
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()

Plot the decision boundary of the model:

In [ ]:
zoom_parameter = 0.2

#---min and max for the first feature---
x_min, x_max = X_train[:, 0].min() - zoom_parameter, X_train[:, 0].max() + zoom_parameter

#---min and max for the second feature---
y_min, y_max = X_train[:, 1].min() - zoom_parameter, X_train[:, 1].max() + zoom_parameter

#---step size in the mesh---
x_step = (x_max - x_min) / 100
y_step = (y_max - y_min) / 100

#---make predictions for each of the points in xx,yy---
xx, yy = np.meshgrid(np.arange(x_min, x_max, x_step), np.arange(y_min, y_max, y_step))

Z = knn_model.predict(np.c_[xx.ravel(), yy.ravel()])

#---draw the result using a color plot---
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Accent, alpha=0.8)

#---plot the training points---
colors = ['red', 'green', 'blue']
types = ['CR', 'LP', 'SL']
for color, i, target in zip(colors, [0, 1, 2], types):
    plt.scatter(X_train[y_train==i, 0], X_train[y_train==i, 1], color=color, label=target)

plt.xlabel('Roughness of Defect Edge (re)')
plt.ylabel('Roughness Contrast (rc)')
plt.title(f'Decision Surface for KNN model with (k={k})')
plt.legend(loc='best', shadow=False, scatterpoints=1)
Out[ ]:
<matplotlib.legend.Legend at 0x12f785ffa90>

It seems like there may be some slight overfitting here, but it is not too bad. We will see how it performs.

Predict weld defect types using testing data:

In [ ]:
y_pred = knn_model.predict(X_test)
print(y_pred)
[2 2 2 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0]
In [ ]:
from sklearn.metrics import confusion_matrix

mat_test = confusion_matrix(y_test, y_pred)
print('confusion matrix = \n', mat_test)
confusion matrix = 
 [[7 0 0]
 [1 9 0]
 [0 1 9]]
In [ ]:
fig, ax = plt.subplots(1, 1, figsize=(4, 4))
cm = confusion_matrix(y_test, y_pred)
ax = sns.heatmap(cm, annot=True, square=True, xticklabels=types, yticklabels=types)
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Actual Labels')
Out[ ]:
Text(17.25, 0.5, 'Actual Labels')

Some precursors to confusion matrix calculations:

In [ ]:
# True Positive (TP) = diagonal elements
CR_TP = mat_test[0,0]
LP_TP = mat_test[1,1]
SL_TP = mat_test[2,2]
print(CR_TP, LP_TP, SL_TP) 

# False Negative (FN) = sum of row - TP
CR_FN = sum(mat_test[0])-CR_TP
LP_FN = sum(mat_test[1])-LP_TP
SL_FN = sum(mat_test[2])-SL_TP
print(CR_FN, LP_FN, SL_FN)

# False Positive (FP) = sum of column - TP
CR_FP = sum(mat_test[:,0])-CR_TP
LP_FP = sum(mat_test[:,1])-LP_TP
SL_FP = sum(mat_test[:,2])-SL_TP
print(CR_FP, LP_FP, SL_FP)
7 9 9
0 1 1
1 1 0

The True Positive Rate (or Recall or Sensitivity) can be calculated using the formula:

image-2.png

In [ ]:
CR_TPR = CR_TP/(CR_TP+CR_FN)
LP_TPR = LP_TP/(LP_TP+LP_FN)
SL_TPR = SL_TP/(SL_TP+SL_FN)
print(CR_TPR, LP_TPR, SL_TPR)
1.0 0.9 0.9

The Positive Predictive Rate (or Precision) can be calculated using the formula:

image.png

In [ ]:
CR_PPR = CR_TP/(CR_TP+CR_FP)
LP_PPR = LP_TP/(LP_TP+LP_FP)
SL_PPR = SL_TP/(SL_TP+SL_FP)
print(CR_PPR, LP_PPR, SL_PPR)
0.875 0.9 1.0

Final Accuracy can be calulated using:

image.png

In [ ]:
PPR = (CR_TP + LP_TP + SL_TP)/sum(sum(mat_test))
print(PPR)
0.9259259259259259

Let's verify our manual calculations with the classification_report function:

In [ ]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=types))
              precision    recall  f1-score   support

          CR       0.88      1.00      0.93         7
          LP       0.90      0.90      0.90        10
          SL       1.00      0.90      0.95        10

    accuracy                           0.93        27
   macro avg       0.92      0.93      0.93        27
weighted avg       0.93      0.93      0.93        27

An ~93% accurate model is not bad here. CR is 100% accurately predicted, most likely largely due to the large clump of its datapoints near its border with LP. Actual LP and SL points each have 1 as misclassified. The LP misclassification has its datapoint over in the CR region while SL has its datapoint in the LP region. This was expected when first visualizing the data in the singular re vs rc scatterplot earlier. However, the accuracy turned out greater than expected.

Finding best K value¶

Our model works well but it is always good to check if we can improve it. We can use a for loop to find the best K value for our model:

In [ ]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

ac_scores = []
k_neighbors = list(range(1,21))
k_neighbors = [k for k in k_neighbors if k % 3 != 0] # remove multiples of 3 to avoid ties

for k in k_neighbors: 
    knn_model = KNeighborsClassifier(n_neighbors=k)
    knn_model.fit(X_train, y_train)
    y_pred = knn_model.predict(X_test)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print(f'k={k}: {f1*100:0.2f}%')
    # print(classification_report(y_test, y_pred, target_names=types))

    ac_score = accuracy_score(y_test, y_pred)
    ac_scores.append(ac_score)
k=1: 92.59%
k=2: 89.06%
k=4: 92.62%
k=5: 92.62%
k=7: 92.69%
k=8: 89.06%
k=10: 92.69%
k=11: 92.69%
k=13: 92.69%
k=14: 92.69%
k=16: 92.69%
k=17: 92.69%
k=19: 92.69%
k=20: 92.69%

It seems that our original and default k value of 5 was very good for our model. However, there is a slightly higher percentage when going to k=10 and above. That slight increase in accuracy may not be worth the compuational power in some other cases but here it does not affect our research usage so we can change our k to 10 or so if we were to predict further or perform k-fold cross validation. Higher k values also reduced risk of overfitting.

Misclassification Error¶

Looking into and plotting the Misclassification Error (MSE):

In [ ]:
# changing to misclassification error:
MSE = [1 - x for x in ac_scores]

# determining best k:
optimal_k = k_neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)
The optimal number of neighbors is 1
In [ ]:
 # plot misclassification error vs k:
plt.plot(k_neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

We can see that at 1 neighbor, the error is lowest. However, this would be heavily overfit. Our original k value of 5 can be good but in between two peaks of error, this may not be the best choice. As stated before we even looked at MSE, we should go for a higher k. Perhaps 11 or 13 would be a comfortable choice.

K-Fold Cross Validation Using Entire Dataset¶

Instead of just using 2 features and 3 weld defect types, we can use the entire dataset with all features and all 5 weld defect types to see how this model will do. We will do this across multiple folds as well.

In [ ]:
subsetsall # our combination of subsets as one large dataset (with encoded type columns, and 1 duplicate row removed and  as discussed earlier)
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc PO SL LP LF CR type_num
0 PO 0.008596 0.006897 0.5748 0.838397 0.998562 0.091802 0.908459 0.003151 0.111302 0.256742 0.389952 1 0 0 0 0 3
1 PO 0.010029 0.003448 0.4112 0.838397 0.649317 0.039172 0.476520 0.002817 0.121299 0.332611 0.443785 1 0 0 0 0 3
2 PO 0.007163 0.003448 0.4400 1.007173 0.754309 0.048079 0.766430 0.002621 0.127759 0.323068 0.444515 1 0 0 0 0 3
3 PO 0.028653 0.003448 0.3124 0.534599 0.061617 0.244800 0.789110 0.010007 0.092632 0.220312 0.339685 1 0 0 0 0 3
4 PO 0.018625 0.003448 0.4024 0.557089 0.037346 0.578774 0.630554 0.006757 0.073914 0.270908 0.273045 1 0 0 0 0 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 CR 0.277937 0.949262 1.0268 0.102869 0.723013 0.025025 0.468658 0.101296 0.757683 0.231426 0.516244 0 0 0 0 1 0
215 CR 0.148997 0.720690 0.8172 0.055527 0.509504 0.135456 0.551284 0.010890 0.262126 0.410800 0.530843 0 0 0 0 1 0
216 CR 0.320917 0.846359 0.7100 0.106793 0.407912 0.027538 0.488077 0.191586 0.757547 0.158517 0.559012 0 0 0 0 1 0
217 CR 0.322350 0.578386 0.6420 0.143629 0.384393 0.039732 0.492730 0.154902 0.640716 0.218541 0.567931 0 0 0 0 1 0
218 CR 0.372493 0.799686 0.8580 0.167046 0.235256 0.075930 0.558360 0.268964 0.637409 0.164191 0.586349 0 0 0 0 1 0

219 rows × 18 columns

In [ ]:
X, y = subsetsall[nums].values, subsetsall['type'].values
print(X.shape)
print(y.shape)
(219, 11)
(219,)
In [ ]:
from sklearn.model_selection import cross_val_score

#---holds the cv (cross-validates) scores---
cv_scores = []

#---number of folds---
folds = 10

#---creating odd list of K for KNN---
ks = list(range(1, int(len(X) * ((folds - 1)/folds))))

# ---remove all multiples of 5 as this is a 5 class problem and we want to avoid ties---
ks = [k for k in ks if k % 5 != 0]

#---perform k-fold cross validation---
for k in ks:
    knn = KNeighborsClassifier(n_neighbors=k)
    #---performs cross-validation and returns the average accuracy---
    scores = cross_val_score(knn, X, y, cv=folds, scoring='accuracy')
    mean = scores.mean()
    cv_scores.append(mean)
    print(k, mean)
1 0.8123376623376621
2 0.79004329004329
3 0.7898268398268399
4 0.8175324675324676
6 0.8220779220779221
7 0.8220779220779221
8 0.8359307359307359
9 0.8177489177489177
11 0.7902597402597402
12 0.8043290043290042
13 0.7993506493506494
14 0.7904761904761906
16 0.7902597402597402
17 0.7764069264069263
18 0.7809523809523811
19 0.7945887445887446
21 0.7673160173160173
22 0.7673160173160174
23 0.7627705627705628
24 0.7718614718614718
26 0.7627705627705628
27 0.7675324675324675
28 0.7629870129870129
29 0.7766233766233765
31 0.7629870129870129
32 0.772077922077922
33 0.7675324675324674
34 0.7448051948051948
36 0.7629870129870129
37 0.7445887445887445
38 0.7536796536796537
39 0.7627705627705628
41 0.7627705627705628
42 0.7536796536796536
43 0.7582251082251082
44 0.7584415584415585
46 0.7584415584415585
47 0.762987012987013
48 0.7443722943722944
49 0.758008658008658
51 0.7625541125541125
52 0.758008658008658
53 0.7534632034632034
54 0.758008658008658
56 0.7489177489177489
57 0.7352813852813853
58 0.7352813852813853
59 0.7214285714285714
61 0.7032467532467532
62 0.6941558441558442
63 0.7032467532467533
64 0.7077922077922079
66 0.6941558441558441
67 0.675974025974026
68 0.675974025974026
69 0.666883116883117
71 0.6712121212121211
72 0.6530303030303031
73 0.6393939393939394
74 0.6300865800865803
76 0.6209956709956709
77 0.6164502164502164
78 0.607142857142857
79 0.5935064935064934
81 0.5935064935064934
82 0.598051948051948
83 0.5889610389610389
84 0.5889610389610389
86 0.5798701298701298
87 0.5932900432900432
88 0.5841991341991342
89 0.5839826839826839
91 0.5703463203463203
92 0.5748917748917749
93 0.5658008658008657
94 0.5567099567099567
96 0.566017316017316
97 0.5478354978354978
98 0.5432900432900433
99 0.5387445887445887
101 0.5432900432900433
102 0.5387445887445887
103 0.5432900432900433
104 0.5341991341991341
106 0.5203463203463203
107 0.5294372294372294
108 0.533982683982684
109 0.5158008658008658
111 0.5203463203463203
112 0.5203463203463203
113 0.5203463203463203
114 0.5203463203463203
116 0.5021645021645021
117 0.4976190476190476
118 0.4976190476190476
119 0.4976190476190476
121 0.48398268398268396
122 0.48852813852813853
123 0.4930735930735931
124 0.4930735930735931
126 0.48852813852813853
127 0.48398268398268396
128 0.47943722943722944
129 0.48398268398268396
131 0.48398268398268396
132 0.48852813852813853
133 0.4930735930735931
134 0.47943722943722944
136 0.47943722943722944
137 0.47943722943722944
138 0.48852813852813853
139 0.47943722943722944
141 0.4930735930735931
142 0.48852813852813853
143 0.474891774891775
144 0.474891774891775
146 0.47489177489177486
147 0.47489177489177486
148 0.474891774891775
149 0.474891774891775
151 0.4703463203463204
152 0.46580086580086577
153 0.4703463203463204
154 0.474891774891775
156 0.48398268398268396
157 0.4930735930735931
158 0.5112554112554113
159 0.5021645021645021
161 0.49783549783549785
162 0.4932900432900433
163 0.49350649350649345
164 0.507142857142857
166 0.498051948051948
167 0.4932900432900433
168 0.4796536796536796
169 0.47510822510822515
171 0.48441558441558435
172 0.4796536796536796
173 0.4796536796536796
174 0.4705627705627705
176 0.4705627705627705
177 0.46580086580086577
178 0.461038961038961
179 0.45649350649350656
181 0.45649350649350656
182 0.45649350649350656
183 0.45649350649350656
184 0.46580086580086577
186 0.451948051948052
187 0.4428571428571429
188 0.4292207792207792
189 0.4337662337662337
191 0.42467532467532465
192 0.42012987012987013
193 0.41558441558441556
194 0.4012987012987013
196 0.32835497835497834
In [ ]:
#---calculate misclassification error for each k---
MSE = [1 - x for x in cv_scores]

#---determining best k (min. MSE)---
optimal_k = ks[MSE.index(min(MSE))]
print(f"The optimal number of neighbors is {optimal_k}")

#---plot misclassification error vs k---
plt.plot(ks, MSE)
plt.plot(optimal_k, MSE[optimal_k-1], 'r', marker='*', label='optimal k')
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error (MSE)')
plt.legend()
plt.show()
The optimal number of neighbors is 8
In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=5)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(175, 11)
(175,)
(44, 11)
(44,)
In [ ]:
k = optimal_k
knn_model = KNeighborsClassifier(n_neighbors=k)

knn_model.fit(X_train, y_train)
Out[ ]:
KNeighborsClassifier(n_neighbors=8)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=8)
In [ ]:
y_pred = knn_model.predict(X_test)
In [ ]:
mat_test = confusion_matrix(y_test, y_pred)
print('confusion matrix = \n', mat_test)
confusion matrix = 
 [[ 6  0  0  0  0]
 [ 0  6  0  1  1]
 [ 0  0 11  0  0]
 [ 0  1  0  8  1]
 [ 0  4  0  2  3]]
In [ ]:
types = subsetsall['type'].unique()
types = list(types)

fig, ax2 = plt.subplots(1, 1, figsize=(4, 4))
cm = confusion_matrix(y_test, y_pred)
ax2 = sns.heatmap(cm, annot=True, square=True, xticklabels=types, yticklabels=types)
ax2.set_xlabel('Predicted Labels')
ax2.set_ylabel('Actual Labels')
Out[ ]:
Text(17.25, 0.5, 'Actual Labels')
In [ ]:
print(classification_report(y_test, y_pred, target_names=types))
              precision    recall  f1-score   support

          PO       1.00      1.00      1.00         6
          SL       0.55      0.75      0.63         8
          LP       1.00      1.00      1.00        11
          LF       0.73      0.80      0.76        10
          CR       0.60      0.33      0.43         9

    accuracy                           0.77        44
   macro avg       0.77      0.78      0.76        44
weighted avg       0.77      0.77      0.76        44

The overall accuracy is lower here than when using only 2 features and 3 weld defect types with k=5 neighbors. If you are knowledgeable in the machine learning space, this can be expected as the model is more complex and has more possibly useless or detrimental data to work with. (However, it is still a decent accuracy and we can see that the model is not overfitting due to k fold cross validation.)

We can use this model for predicting PO and LP weld defect types as they have a perfect f1-score [at least in this random split of the data]. Also, this possibly could have predicted much earlier by looking at theses two types' density in the inital pairplot (namely ar vs ar and sp vs sp, but high in others as well).
More curated features (such as in our first KNN explored here) and perhaps less type possibllities (by first filtering out PO and LP for example) would be better for predicting the other types.

Mainly, this exploration of the data (first with a curated selection and then a full set) shows the importance of tuning the model with the best features to show the best type splits. It is not always best to use all features and all types. You can't just throw them all in and expect the best results. Muliple models should be made and used in conjunction with each other to get the best prediction results, in this weld defect dataset and in general.

DT (Decision Tree) Classifier¶

While the k nearest neighbors approach produced a good accuracy, another way to classify the data is with a decision tree. We will use the same two strong features re and rc; and targets SL, LP, and CR as in our KNN approach.

In [ ]:
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text, export_graphviz 
import IPython
from IPython.display import Image
import pydotplus
import graphviz

Lets prep our data for the DT classifier:

In [ ]:
X = []
y = []
for i in range(5):
    X.append(subsets[i][['re', 'rc']].values)
    y.append(subsets[i]['type_num'].values)
print(X[3].shape)
print(y[3].shape)

# Training Data - subsets 1, 2, 3, and 5 (80% of data)
X_train = np.concatenate([X[i] for i in [0, 1, 2, 4]], axis=0)
y_train = np.concatenate([y[i] for i in [0, 1, 2, 4]], axis=0)
print(X_train.shape)
print(y_train.shape)

# Testing Data - subset 4 (20% of data)
X_test = X[3]
y_test = y[3]
print(X_test.shape)
print(y_test.shape)
(27, 2)
(27,)
(108, 2)
(108,)
(27, 2)
(27,)

We will create two DT classifier models, and experiment with different values for the following 4 parameters:

In [ ]:
max_depths = [2, 3, 4, 5, 6, 7, 8, 9, 10]
criterions = ['gini', 'entropy']
max_leaf_nodes = [2, 3, 4, 5, 6, 7, 8, 9, 10]
min_samples_leafs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

DT Model Dicussion:¶

The indicies of the parameters can be adjusted to change the values of the 4 parameters. After experimenting by running with a few different options, we have settled to use the following indicies/parameters for our 2 models:

  • The first model, gini, will be optimized for run time and thus has small max_depth and max_leaf_node, and slightly larger min_samples_leaf.
  • The second, entropy, creates more balanced trees by default and we will push this further with a large max_depth and max_leaf_node, and very small min_samples_leaf value.
In [ ]:
threetypes = ['CR', 'LP', 'SL']

DTgini    = DecisionTreeClassifier(criterion=criterions[0], 
                                    max_depth=max_depths[0],
                                    max_leaf_nodes=max_leaf_nodes[1],
                                    min_samples_leaf=min_samples_leafs[5])

DTentropy = DecisionTreeClassifier(criterion=criterions[1], 
                                    max_depth=max_depths[8],
                                    max_leaf_nodes=max_leaf_nodes[8],
                                    min_samples_leaf=min_samples_leafs[0])

DTgini_model = DTgini.fit(X_train, y_train)
DTentropy_model = DTentropy.fit(X_train, y_train)
In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(15, 8))  # adjust the size as needed

# Decision Tree 1
plot_tree(DTgini_model,
            feature_names=['re', 'rc'],
            class_names=threetypes,
            filled=True, ax=axes[0])
axes[0].set_title('Gini', fontsize=16)

# Decision Tree 2
plot_tree(DTentropy_model,
            feature_names=['re', 'rc'],
            class_names=threetypes,
            filled=True, ax=axes[1])
axes[1].set_title('Entropy', fontsize=16)

fig.suptitle('Decision Trees Comparison', fontsize=16)  # Setting overall title
plt.tight_layout()  # to prevent overlapping
plt.show()

The trees can be made with export_graphviz and graphviz as well like so:

In [ ]:
# Decision Tree 1
dot_data1 = export_graphviz(DTgini_model, 
                            out_file=None, 
                            feature_names=['re', 'rc'],  
                            class_names=True, 
                            filled=True)

graph1 = graphviz.Source(dot_data1) 
graph1
Out[ ]:
Tree 0 re <= 0.315 gini = 0.658 samples = 108 value = [28, 40, 40] class = y[1] 1 rc <= 0.248 gini = 0.484 samples = 68 value = [28, 40, 0] class = y[1] 0->1 True 2 gini = 0.0 samples = 40 value = [0, 0, 40] class = y[2] 0->2 False 3 gini = 0.0 samples = 38 value = [0, 38, 0] class = y[1] 1->3 4 gini = 0.124 samples = 30 value = [28, 2, 0] class = y[0] 1->4
In [ ]:
# Decision Tree 2
dot_data2 = export_graphviz(DTentropy_model, 
                            out_file=None, 
                            feature_names=['re', 'rc'],  
                            class_names=True, 
                            filled=True)

graph2 = graphviz.Source(dot_data2) 
graph2
Out[ ]:
Tree 0 re <= 0.315 entropy = 1.566 samples = 108 value = [28, 40, 40] class = y[1] 1 rc <= 0.248 entropy = 0.977 samples = 68 value = [28, 40, 0] class = y[1] 0->1 True 2 entropy = 0.0 samples = 40 value = [0, 0, 40] class = y[2] 0->2 False 3 entropy = 0.0 samples = 38 value = [0, 38, 0] class = y[1] 1->3 4 re <= 0.031 entropy = 0.353 samples = 30 value = [28, 2, 0] class = y[0] 1->4 5 entropy = 0.0 samples = 1 value = [0, 1, 0] class = y[1] 4->5 6 re <= 0.227 entropy = 0.216 samples = 29 value = [28, 1, 0] class = y[0] 4->6 7 entropy = 0.0 samples = 26 value = [26, 0, 0] class = y[0] 6->7 8 rc <= 0.442 entropy = 0.918 samples = 3 value = [2, 1, 0] class = y[0] 6->8 9 entropy = 0.0 samples = 1 value = [0, 1, 0] class = y[1] 8->9 10 entropy = 0.0 samples = 2 value = [2, 0, 0] class = y[0] 8->10

We can also print these trees in text like so:

In [ ]:
# Use export_text to create text reports of the decision trees
tree_rules_gini = export_text(DTgini_model, feature_names=['re', 'rc'])
tree_rules_entropy = export_text(DTentropy_model, feature_names=['re', 'rc'])

print("Decision tree rules for Gini model:\n", tree_rules_gini)
print("Decision tree rules for Entropy model:\n", tree_rules_entropy)
Decision tree rules for Gini model:
 |--- re <= 0.32
|   |--- rc <= 0.25
|   |   |--- class: 1
|   |--- rc >  0.25
|   |   |--- class: 0
|--- re >  0.32
|   |--- class: 2

Decision tree rules for Entropy model:
 |--- re <= 0.32
|   |--- rc <= 0.25
|   |   |--- class: 1
|   |--- rc >  0.25
|   |   |--- re <= 0.03
|   |   |   |--- class: 1
|   |   |--- re >  0.03
|   |   |   |--- re <= 0.23
|   |   |   |   |--- class: 0
|   |   |   |--- re >  0.23
|   |   |   |   |--- rc <= 0.44
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- rc >  0.44
|   |   |   |   |   |--- class: 0
|--- re >  0.32
|   |--- class: 2

We can also plot the decision boundary of the 2 models:

In [ ]:
zoom_parameter = 0.2

#---min and max for the first feature---
x_min, x_max = X_train[:, 0].min() - zoom_parameter, X_train[:, 0].max() + zoom_parameter

#---min and max for the second feature---
y_min, y_max = X_train[:, 1].min() - zoom_parameter, X_train[:, 1].max() + zoom_parameter

#---step size in the mesh---
x_step = (x_max - x_min) / 100
y_step = (y_max - y_min) / 100

#---make predictions for each of the points in xx,yy---
xx, yy = np.meshgrid(np.arange(x_min, x_max, x_step), np.arange(y_min, y_max, y_step))

Z = DTgini_model.predict(np.c_[xx.ravel(), yy.ravel()])

#---draw the result using a color plot---
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Accent, alpha=0.8)

#---plot the training points---
colors = ['red', 'green', 'blue']
types = ['CR', 'LP', 'SL']
for color, i, target in zip(colors, [0, 1, 2], types):
    plt.scatter(X_train[y_train==i, 0], X_train[y_train==i, 1], color=color, label=target)

plt.xlabel('Roughness of Defect Edge (re)')
plt.ylabel('Roughness Contrast (rc)')
plt.title(f'Decision Surface for DT model with: \n criterion=gini \n max_depth={max_depths[0]} \n min_samples_leaf={min_samples_leafs[1]} \n max_leaf_nodes={max_leaf_nodes[2]}')
plt.legend(loc='best', shadow=False, scatterpoints=1)
Out[ ]:
<matplotlib.legend.Legend at 0x12f781a58d0>
In [ ]:
Z = DTentropy_model.predict(np.c_[xx.ravel(), yy.ravel()])

#---draw the result using a color plot---
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Accent, alpha=0.8)

#---plot the training points---
colors = ['red', 'green', 'blue']
types = ['CR', 'LP', 'SL']
for color, i, target in zip(colors, [0, 1, 2], types):
    plt.scatter(X_train[y_train==i, 0], X_train[y_train==i, 1], color=color, label=target)

plt.xlabel('Roughness of Defect Edge (re)')
plt.ylabel('Roughness Contrast (rc)')
plt.title(f'Decision Surface for DT model with: \n criterion=entropy \n max_depth={max_depths[1]} \n min_samples_leaf={min_samples_leafs[1]} \n max_leaf_nodes={max_leaf_nodes[2]}')
plt.legend(loc='best', shadow=False, scatterpoints=1)
Out[ ]:
<matplotlib.legend.Legend at 0x12f77e75f00>

DT Model Dicussion Extended:¶

  • The first model does a good job of splitting up the data into the 3 types and does so with basically only 2 straight lines. It seems to only get only 2 data points incorreclty classified. (2 LP points classified as CR)

  • The second model does an even better job of splitting up the data into their 3 types but it needs more lines/decision boundaries to do so. These lines come about mainly due to the min_samples_leaf parameter being set very low, literally 1 in this case. Also note that this is on the training data. It is likely that this model is overfitting. We will see how it performs on the testing data.

If we had to chose now, the first model would be the better choice. It is simpler and likely does better on unseen data.

DT Model Performance:¶

We used subsets 1, 2, 3, and 5 as our training data. Thus, using the fourth subset as our testing data, we can see how the two models perform:

In [ ]:
print(f'DT1 training score: {DTgini_model.score(X_train, y_train)}')
print(f'DT2 training score: {DTentropy_model.score(X_train, y_train)}')
print(f'DT1 testing  score: {DTgini_model.score(X_test, y_test)}')
print(f'DT2 testing  score: {DTentropy_model.score(X_test, y_test)}')
DT1 training score: 0.9814814814814815
DT2 training score: 1.0
DT1 testing  score: 0.9629629629629629
DT2 testing  score: 0.9259259259259259

We can see while the first model performs worse on the training data, it performs better on the testing data.
Our prediction was correct (at least for this case). The simpler model achived a better score on this unseen data as it did not overfit on training data.

Predict Weld Defect Types on testing data:

In [ ]:
y_pred_proba_DT1 = DTgini_model.predict_proba(X_test)
print(f'DT1 prediction probabilities: \n {y_pred_proba_DT1}')

y_pred_proba_DT2 = DTentropy_model.predict_proba(X_test)
print(f'DT2 prediction probabilities: \n {y_pred_proba_DT2}')
DT1 prediction probabilities: 
 [[0.         0.         1.        ]
 [0.         0.         1.        ]
 [0.         0.         1.        ]
 [0.         0.         1.        ]
 [0.         0.         1.        ]
 [0.         0.         1.        ]
 [0.         0.         1.        ]
 [0.         0.         1.        ]
 [0.         0.         1.        ]
 [0.         0.         1.        ]
 [0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.93333333 0.06666667 0.        ]
 [0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.93333333 0.06666667 0.        ]
 [0.93333333 0.06666667 0.        ]
 [0.93333333 0.06666667 0.        ]
 [0.93333333 0.06666667 0.        ]
 [0.93333333 0.06666667 0.        ]
 [0.93333333 0.06666667 0.        ]
 [0.93333333 0.06666667 0.        ]]
DT2 prediction probabilities: 
 [[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]

Make actual predictions on the testing data:

In [ ]:
print(f'   y_test: {y_test}')

y_pred_DT1 = DTgini_model.predict(X_test)
print(f'DT1 preds: {y_pred_DT1}')

y_pred_DT2 = DTentropy_model.predict(X_test)
print(f'DT2 preds: {y_pred_DT2}')
   y_test: [2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0]
DT1 preds: [2 2 2 2 2 2 2 2 2 2 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0]
DT2 preds: [2 2 2 2 2 2 2 2 2 2 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0]

Create Confusion Matricies and Classification Reports from predictions:

In [ ]:
mat_test1 = confusion_matrix(y_test, y_pred_DT1)
print('confusion matrix DT1 = \n', mat_test1)

mat_test2 = confusion_matrix(y_test, y_pred_DT2)
print('confusion matrix DT2 = \n', mat_test2)
confusion matrix DT1 = 
 [[ 7  0  0]
 [ 1  9  0]
 [ 0  0 10]]
confusion matrix DT2 = 
 [[ 6  1  0]
 [ 1  9  0]
 [ 0  0 10]]
confusion matrix DT2 = 
 [[ 6  1  0]
 [ 1  9  0]
 [ 0  0 10]]
In [ ]:
fig, ax3 = plt.subplots(1, 1, figsize=(4, 4))
cm3 = confusion_matrix(y_test, y_pred_DT1)
ax3 = sns.heatmap(cm3, annot=True, square=True, xticklabels=threetypes, yticklabels=threetypes)
ax3.set_xlabel('Predicted Labels')
ax3.set_ylabel('Actual Labels')
ax3.set_title('DT1 Confusion Matrix')

print(classification_report(y_test, y_pred_DT1, target_names=threetypes))
              precision    recall  f1-score   support

          CR       0.88      1.00      0.93         7
          LP       1.00      0.90      0.95        10
          SL       1.00      1.00      1.00        10

    accuracy                           0.96        27
   macro avg       0.96      0.97      0.96        27
weighted avg       0.97      0.96      0.96        27

In [ ]:
fig, ax4 = plt.subplots(1, 1, figsize=(4, 4))
cm4 = confusion_matrix(y_test, y_pred_DT2)
ax4 = sns.heatmap(cm4, annot=True, square=True, xticklabels=threetypes, yticklabels=threetypes)
ax4.set_xlabel('Predicted Labels')
ax4.set_ylabel('Actual Labels')
ax4.set_title('DT2 Confusion Matrix')

print(classification_report(y_test, y_pred_DT2, target_names=threetypes))
              precision    recall  f1-score   support

          CR       0.86      0.86      0.86         7
          LP       0.90      0.90      0.90        10
          SL       1.00      1.00      1.00        10

    accuracy                           0.93        27
   macro avg       0.92      0.92      0.92        27
weighted avg       0.93      0.93      0.93        27

MSE can also be calculated as simply as 1-accuracy for this case. Cross Validation will be done later.

In [ ]:
MSE_DT1 = 1 - DTgini_model.score(X_test, y_test)
print(f'DT1 MSE: {MSE_DT1}')
MSE_DT2 = 1 - DTentropy_model.score(X_test, y_test)
print(f'DT2 MSE: {MSE_DT2}')
DT1 MSE: 0.03703703703703709
DT2 MSE: 0.07407407407407407

As previously discussed, our first model is better. It has a higher accuracy and lower MSE (1 data point [3.7%] misclassified in this case compared to 2 data points [7.4%] in our second model).

DT Model Pruning and Overfitting Protection¶

First and foremost, we will use a combined dataset for this work. We will also use all features and target defect cases:

In [ ]:
subsetsall
Out[ ]:
type w ar sp re rr sk ku hc rc sc kc PO SL LP LF CR type_num
0 PO 0.008596 0.006897 0.5748 0.838397 0.998562 0.091802 0.908459 0.003151 0.111302 0.256742 0.389952 1 0 0 0 0 3
1 PO 0.010029 0.003448 0.4112 0.838397 0.649317 0.039172 0.476520 0.002817 0.121299 0.332611 0.443785 1 0 0 0 0 3
2 PO 0.007163 0.003448 0.4400 1.007173 0.754309 0.048079 0.766430 0.002621 0.127759 0.323068 0.444515 1 0 0 0 0 3
3 PO 0.028653 0.003448 0.3124 0.534599 0.061617 0.244800 0.789110 0.010007 0.092632 0.220312 0.339685 1 0 0 0 0 3
4 PO 0.018625 0.003448 0.4024 0.557089 0.037346 0.578774 0.630554 0.006757 0.073914 0.270908 0.273045 1 0 0 0 0 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 CR 0.277937 0.949262 1.0268 0.102869 0.723013 0.025025 0.468658 0.101296 0.757683 0.231426 0.516244 0 0 0 0 1 0
215 CR 0.148997 0.720690 0.8172 0.055527 0.509504 0.135456 0.551284 0.010890 0.262126 0.410800 0.530843 0 0 0 0 1 0
216 CR 0.320917 0.846359 0.7100 0.106793 0.407912 0.027538 0.488077 0.191586 0.757547 0.158517 0.559012 0 0 0 0 1 0
217 CR 0.322350 0.578386 0.6420 0.143629 0.384393 0.039732 0.492730 0.154902 0.640716 0.218541 0.567931 0 0 0 0 1 0
218 CR 0.372493 0.799686 0.8580 0.167046 0.235256 0.075930 0.558360 0.268964 0.637409 0.164191 0.586349 0 0 0 0 1 0

219 rows × 18 columns

In [ ]:
types = subsetsall['type'].unique()
types = list(types)
types
Out[ ]:
['PO', 'SL', 'LP', 'LF', 'CR']
In [ ]:
X, y = subsetsall[nums].values, subsetsall['type'].values
print(X.shape)
print(y.shape)
(219, 11)
(219,)

We will perform a GridSearchCV cross-validation technique with K = 10 (ten) folds, to determine the optimal value for the max depth hyperparameter of the DecisionTreeClassifier class. We will increase the max depth from 1 to 20:

In [ ]:
from sklearn.model_selection import GridSearchCV

# Split the dataset into a training set and a test set
# using same random state as for KNN model before incase we want to do any comparisons later
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5) 

# Define the model
dtc = DecisionTreeClassifier(random_state=40) # use random state parameter for control

# Define the parameter values that should be searched
param_grid = {'max_depth': list(range(1, 21))}

# Instantiate the grid
grid = GridSearchCV(dtc, param_grid, cv=10, scoring='accuracy')

# Fit the grid with data
grid.fit(X_train, y_train)

# View the complete results
# print(grid.cv_results_)

# Examine the best model
print("Best score: ", grid.best_score_)
print("Best param: ", grid.best_params_)
Best score:  0.8516339869281045
Best param:  {'max_depth': 4}
In [ ]:
import numpy as np

train_scores = grid.cv_results_['mean_test_score']

test_scores = []
max_depth_range = list(range(1, 21))

# Set the number of random state iterations
n_random_states = 100

for depth in max_depth_range:
    test_accuracy_list = []
    for i in range(n_random_states):
        # Initialize the DecisionTreeClassifier with different random states
        dtc = DecisionTreeClassifier(max_depth=depth, random_state=i)
        dtc.fit(X_train, y_train)
    
        # Predict and calculate accuracy for test set
        test_pred = dtc.predict(X_test)
        test_accuracy = accuracy_score(y_test, test_pred)
        test_accuracy_list.append(test_accuracy)
    
    # Append the mean test accuracy for this depth
    test_scores.append(np.mean(test_accuracy_list))

plt.figure(figsize=(10, 5))
plt.plot(max_depth_range, train_scores, label='Mean CV Accuracy on Training Set')
plt.plot(max_depth_range, test_scores, label='Mean Accuracy on Test Set (over Random States)')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

We can see that the optimal value for the max depth hyperparameter is 4. (This has also been attemped manually over different DTClassifier random states.)

However, when plotting mean testing data (over random states of DTClassifier, not train_test_split) we see that at a max_depth of 4, it is not he highest accuracy.

If we were to use the max_depth value of 4, we would be underfitting the data in our final model. We will use a max_depth of 8 instead. We get the highest accuracy at this value in the testing data, and almost the highest in the training data, without going to high as to risk overfitting. This can be investigated more but for now a max_depth of 8 is a decent value to use.

Optimized Model:

In [ ]:
# using same random state as when searching for max_depth
# leaving default for all other parameters as was done when searching for max_depth as well
dtc_all = DecisionTreeClassifier(max_depth=8, random_state=40) 
dtc_all_model = dtc_all.fit(X_train, y_train)
In [ ]:
# Just out of curiosity, let's see what the tree looks like
plot_tree(dtc_all_model)
plt.show()

Optimized Model Performance:

In [ ]:
print(f'DT training score: {dtc_all_model.score(X_train, y_train)}')
print(f'DT testing  score: {dtc_all_model.score(X_test, y_test)}')
DT training score: 1.0
DT testing  score: 0.9090909090909091

Make actual predictions on the testing data:

In [ ]:
y_pred_dtc_all = dtc_all_model.predict(X_test)
print(f'DT preds: {y_pred_dtc_all}')
DT preds: ['LP' 'SL' 'CR' 'SL' 'LP' 'SL' 'LP' 'LP' 'PO' 'LP' 'PO' 'LP' 'PO' 'LF'
 'LF' 'SL' 'LF' 'CR' 'PO' 'LP' 'SL' 'LP' 'SL' 'CR' 'CR' 'PO' 'LF' 'SL'
 'LF' 'PO' 'PO' 'LF' 'PO' 'LP' 'LF' 'LF' 'PO' 'SL' 'CR' 'LP' 'SL' 'LP'
 'SL' 'CR']

Create Confusion Matrix and Rlassification Report from predictions:

In [ ]:
mat_test_dtc_all = confusion_matrix(y_test, y_pred_dtc_all) 
print('confusion matrix = \n', mat_test_dtc_all)
confusion matrix = 
 [[ 6  0  0  0  0]
 [ 0  6  0  0  2]
 [ 0  0 11  0  0]
 [ 0  1  0  9  0]
 [ 0  1  0  0  8]]
In [ ]:
fig, ax5 = plt.subplots(1, 1, figsize=(4, 4))
cm5 = confusion_matrix(y_test, y_pred_dtc_all)
ax5 = sns.heatmap(cm5, annot=True, square=True, xticklabels=types, yticklabels=types)
ax5.set_xlabel('Predicted Labels')
ax5.set_ylabel('Actual Labels')
ax5.set_title('Optimized DT Confusion Matrix')

print(classification_report(y_test, y_pred_dtc_all, target_names=types))
              precision    recall  f1-score   support

          PO       1.00      1.00      1.00         6
          SL       0.75      0.75      0.75         8
          LP       1.00      1.00      1.00        11
          LF       1.00      0.90      0.95        10
          CR       0.80      0.89      0.84         9

    accuracy                           0.91        44
   macro avg       0.91      0.91      0.91        44
weighted avg       0.91      0.91      0.91        44

Misclassification Error (only simple 1-accuracy for this optimized model, not a full MSE on k-fold cross validation):

In [ ]:
MSE = 1 - dtc_all_model.score(X_test, y_test)
print(f'DT MSE: {MSE*100:.2f}%')
DT MSE: 9.09%

Final Performance Comparison / Dicussion¶

The overall performance of the model is good as seen by the f1 accuracy of 91%. Additionally, it does not seem to be overfitting.

However, we must remember the ultimate goal of this model is to classify the weld defect types absoutely. That is likely the primacy concern of the end customer. For some weld defect types, this one optimized model actually has a worse performance than the smaller subset of data model we created earlier that only used 2 features and 3 weld defect types. Type SL is a great example and the worst offender of this. In our "optimized" model it has a 75% accuracy while in the smaller subset model it had a 100% accuracy. This is a large difference. LP on the other hand does have a higher accuracy in our optimized model than in the smaller subset model. CR has near equivelent accuracy in both models.

In conclusion, this exploration of the data (first with a curated selection and then a full set) shows the importance of tuning the model with the best features to show the best type splits. It is not always best to use all features and all types. You can't just throw them all in and expect the best results. Muliple models should be made and used in conjunction with each other to get the best prediction results, in this weld defect dataset and in general.

Additionally, feature importance can be evalutaed to help with single models (or mulitple). We will do this now.

Feature Importance¶

We can also look at the feature importance of our optimized model:

In [ ]:
importance = pd.DataFrame({'feature': subsetsall[nums].columns,
                            'importance' : np.round(dtc_all_model.feature_importances_, 3)})

importance.sort_values('importance', ascending=False, inplace = True)
importance
Out[ ]:
feature importance
3 re 0.295
1 ar 0.266
8 rc 0.199
0 w 0.051
5 sk 0.047
4 rr 0.045
2 sp 0.034
7 hc 0.030
9 sc 0.014
10 kc 0.013
6 ku 0.007
In [ ]:
# ensure decision tree and importance calculations are using the same number of features
dtc_all_model.n_features_in_
Out[ ]:
11
In [ ]:
ser = pd.Series(importance.importance)
ser.index = importance.feature
ser
Out[ ]:
feature
re    0.295
ar    0.266
rc    0.199
w     0.051
sk    0.047
rr    0.045
sp    0.034
hc    0.030
sc    0.014
kc    0.013
ku    0.007
Name: importance, dtype: float64
In [ ]:
fig, ax6 = plt.subplots(figsize=(7, 3))
ser.plot(kind='barh', ax=ax6)
ax6.set_title("Feature importances using MDI")
ax6.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

As expected/prediced early in teh data exploration through just normal scatter plots (via pair plot) and heat maps, rc and re both have high feature importance. Additonally, ar and w were 2 of the 3 in our list of other contenders (cell [22] description.). Thus, their prominent feature importance is not of surprise either. We looked at the data effectively from the start.

However, there was 2 surprises. One large surprise was that sp has a much lower importance than initially thought. Looking back now we can see that its feature importance was overestimated due to its position on the heatmap (cell [22]) next to other actually important features. Human visual error/bias was present in this incorrect prediction. The second surprise was how much lower the importance of w was from the other 3 most important features. Again, this was human error in not analyzing the data well enough by not looking at the heatmap or other data for enough time and not doing well enough quick mental math to make better estimations/predictions. Thus, our use of this feature importance tool to figure it out precisely for us is great.

Feature Pruning / Future Work:¶

To continue with feature pruning, we can look at featues with low variance and remove them from the model or do other feature engineering. It can begin as follows:

In [ ]:
 from sklearn import feature_selection

mic = feature_selection.mutual_info_classif(X_train, y_train)

fig, ax7 = plt.subplots(figsize=(7, 3))
(
pd.DataFrame(
{"feature": subsetsall[nums].columns, "vimp": mic}
)
.set_index("feature")
.plot.barh(ax=ax7)
)

ax7.set_title("Feature Selection with Mutual Information")
Out[ ]:
Text(0.5, 1.0, 'Feature Selection with Mutual Information')